Lexer

  • In classical compiler design, a lexer’s job is to convert a flat character stream into a flat token stream. It does not understand nesting or structure beyond very local patterns.

  • In the standard model, the lexer is intentionally simple, and understanding nesting is the parser’s job.

  • From a theoretical lexer design standpoint, the key rule is:

  • A lexer should encode only what is unambiguous at the lexical level, and nothing that depends on grammar or semantics.

  • A lexer must never build language constructs whose shape depends on syntax or meaning, even if the meaning is “known”.

    • For example, suppose Vector2(20, 30) : knowing that Vector2  is “just two f32s” is semantic knowledge, not lexical knowledge.

Examples

Example 1
  • [node name="flor" parent="." unique_id=2138173886 instance=ExtResource("3_7sc02")]

LBRACKET
IDENT(node)
IDENT(name)
EQUAL
STRING("flor")
IDENT(parent)
EQUAL
STRING(".")
IDENT(unique_id)
EQUAL
NUMBER(2138173886)
IDENT(instance)
EQUAL
IDENT(ExtResource)
LPAREN
STRING("3_7sc02")
RPAREN
RBRACKET
Example 2
  • Vector2(506, 323)

  • must be tokenized into multiple tokens, not one.

  • Each token independently fits your { type, value } model.

  • Exact theoretical token sequence

  • Token

    • type: IDENTIFIER

    • value: "Vector2"

  • Token

    • type: LEFT_PAREN

    • value: ( or None

  • Token

    • type: NUMBER_LITERAL

    • value: 506 (numeric value, not string)

  • Token

    • type: COMMA

    • value: , or None

  • Token

    • type: NUMBER_LITERAL

    • value: 323 (numeric value, not string)

  • Token

    • type: RIGHT_PAREN

    • value: ) or None

  • That is the full and correct lexical output.

  • Parser responsibility (high level):

    • The parser’s job is to:

      • Consume tokens according to grammar rules

      • Establish structure and relationships(2025-12-25)

      • Produce an AST, not values

    • The parser does not:

      • Decide what Vector2 means

      • Construct arrays

      • Convert to f32

      • Perform semantic validation

  • Where values are manipulated

    • Values are first legitimately created, evaluated, and manipulated during semantic analysis / constant evaluation, after parsing but before code generation.

    • Semantic analysis / constant evaluation / lowering

      • This phase may have different names, but conceptually it is where:

        • Symbols are resolved

        • Types are assigned

        • Expressions may be evaluated

        • Constants may be folded

        • Builtins may be lowered

        • IR-friendly representations are produced

      • This is the first phase allowed to manipulate actual values.